04-Data Visualization (continued)

Today we’ll be working with the diamonds dataset from the ggplot2 package. We want to understand how various features of the diamond influence its price.

Let’s load the ggplot2 package and the diamonds dataset. (Install the package with install.packages("ggplot2") if you have not done so yet.) Look at the documentation to understand what the dataset is about.

library(ggplot2)
data(diamonds)
?diamonds

As usual, we can use str(), head() or View() to see the dataset:

str(diamonds)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Practice

Let’s practice some of the basic plotting that we learnt last session! (Note: Some of these plots may take a while to load as our dataset is quite big.)

Make a histogram of price. Vary the number of bins to see what happens.
Make a scatterplot of price vs. carat. Adjust the alpha to 0.05 to reduce overplotting. Do you see any patterns in the data?
Make a boxplot of price for each value of cut, then make a violin plot instead. How do these plots differ in the information that they give the reader?

Bar plots

Bar plots are useful in describing how often each category appears for a categorical variable. The code below makes a bar plot to show how many diamonds there are for each cut type:

ggplot(data = diamonds, mapping = aes(x = cut)) +
    geom_bar()

Note that for shorter syntax, we can drop data = in ggplot() if our dataset is the first argument within the braces. We can also drop mapping = if (i) it is the second argument within the braces for ggplot(), or (ii) it is the first argument within the braces for the geom_xx() functions. For example, the code below will give the same plot:

ggplot(diamonds, aes(x = cut)) +
    geom_bar()

To make the bars horizontal instead, we can add coord_flip():

ggplot(diamonds, aes(x = cut)) +
    geom_bar() +
    coord_flip()

Layers

Layering allows us to make more sophisticated and informative plots. Let’s go back to the scatterplot of price vs. carat:

ggplot(diamonds, aes(x = carat, y = price)) +
    geom_point(alpha = 0.05)

There certainly seems to be a positive relationship between the two, even though there seems to be a lot of noise too. We can add a geom_smooth() layer that tries to smooth out the noise:

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
    geom_point(alpha = 0.05) +
    geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The heavier the diamond, the more expensive it is. At the same time, we see quite a wide spread of prices for diamonds of the same weight, indicating that there are probably other factors at play.

Relationships between 3 or more variables: scales and facets

Let’s go back to the boxplot of price for each value of cut:

ggplot(diamonds, aes(x = cut, y = price)) +
    geom_boxplot()

It seems unintuitive that the cut of a diamond does not affect its price, and that diamonds of ideal cut have lower prices. Could there be other factors at work? One possibility is that there just aren’t many large diamonds of ideal cut: thus, a diamond of ideal cut tends to weigh less (smaller in carat size), and hence fetches a lower price.

We can explore this theory by modifying other aesthetics. For example, in the scatterplot of price vs. carat, we can let the color of each dot signify its cut:

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2)

There seem to be more yellow dots on top and more purple dots below, lending credence to the intuitive assumption that better cut results in better quality. In this case, changing the color of the dots helped us to understand the data better.

The colors here are the R defaults. We can introduce our own color scale with scale_color_brewer() to make the plot more informative (the full list of color palettes can be found through google image search):

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_color_brewer(palette = "YlOrRd")

There’s still a fair amount of overplotting going on. Can we have separate graphs of price vs. carat for each cut?

This is called splitting the plot into facets. R allows us to do this by using the function facet_wrap(). Use the following code to facet the plot by a single variable:

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    facet_wrap(~ cut)

By default, R put just 3 subplots in each row. We can change this by adding a nrow argument to facet_wrap():

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    facet_wrap(~ cut, nrow = 1)

Facetting didn’t help too much in this case, since the plots for the better cuts look very similar to one another. Perhaps we could add a smoothing layer to the original plot:

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

As you can probably see, the possibilities are endless! You can try plotting different variables against each other and see if you get anything interesting.

If we want to facet by more than 1 variable, we can do so with facet_grid(). The variable before the ~ sign will be split on the rows, while the variable after the ~ sign will be split on the columns:

ggplot(diamonds, aes(x = carat, y = price)) +
    geom_point(alpha = 0.2) +
    facet_grid(cut ~ color)

Themes and non-data ink

Let’s say you’re satisfied with the scatterplot of price vs. carat with color denoting cut, and that you want to share it with others. The first thing you should do is label your axes and give your plot a title:

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    labs(x = "Carat", y = "Price", title = "Plot of carat vs. price")

The size of the labels seems a bit small. We can adjust them using the theme() function. Let’s centralize the plot title at the same time:

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") + 
    theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5),
          axis.title = element_text(size = rel(1.2)))

We can move the legend around by setting a legend.position argument in theme() (possible options are “none”, “left”, “right”, “bottom”, “top”):

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") + 
    theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5),
          axis.title = element_text(size = rel(1.5)),
          legend.position = "bottom")

Just about everything in the plot can be modified. For a full (long!) list of attributes which can be modified, see this reference.

We can also try changing the overall theme of the plot and see if any of them make the visualization more effective:

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") + 
    theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5),
          axis.title = element_text(size = rel(1.5)),
          legend.position = "bottom") +
    theme_bw()

Notice how the legend is not at the bottom and that the plot and axis titles are back to the defaults? This is because we applied theme_bw() last. When we apply theme_bw(), it overwrites all the changes to the theme that we specified in theme(). To avoid this overwrite, we can simply reorder the code:

ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") + 
    theme_bw() +
    theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5),
          axis.title = element_text(size = rel(1.5)),
          legend.position = "bottom")

For a list of complete themes, see this link.

Optional material

Assign plots to variables

It seems tedious to be changing these attributes for each graph we make. The nice thing about ggplot is that it lets us assign each part of the plot as a variable! For example, we could have reproduced the plot above using this code:

p <- ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
    geom_point(alpha = 0.2) +
    scale_colour_brewer(palette = "YlOrRd") +
    labs(x = "Carat", y = "Price", title = "Plot of carat vs. price")
th <- theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5),
            axis.title = element_text(size = rel(1.2)),
            legend.position = "bottom")
p  # plot without the theme changes

p + th

I can now apply these adjustments to any plot I want by adding + th at the end of the code:

ggplot(data = diamonds) +
    geom_histogram(mapping = aes(x = price)) + 
    labs(title = "Histogram of price", x = "Price", y = "Count") +
    th

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

More on color scales

The color scale functions in ggplot2 are of the form scale_x_y, where x is either color or fill, and y is either brewer or distiller. color is for the outline while fill is for the interior; brewer is when we have a discrete number of colors while distiller is for continuous scales. For example, if we wanted to color the points based on price, we would use scale_color_distiller:

ggplot(diamonds, aes(x = carat, y = price, col = price)) +
    geom_point(alpha = 0.2) +
    scale_color_distiller(palette = "YlOrRd")

Some of you might have picked up on an inconsistency here: didn’t I say that scale_color_distiller is for the outline of the shape? Why then is the fill of the points in the plot above changing?

This has to with the shapes aesthetic. R has 26 in-built shapes:

Shapes 0-20 only have the col attribute while shapes 21-25 have both col and fill attributes. We can see this in action when we change the shape aesthetic in the previous plot:

ggplot(diamonds, aes(x = carat, y = price, col = price)) +
    geom_point(alpha = 0.2, shape = 21) +
    scale_color_distiller(palette = "YlOrRd")

It looks almost the same as before but if you look closely, the points have no fill in the interior. The code below makes the fill black:

ggplot(diamonds, aes(x = carat, y = price, col = price)) +
    geom_point(alpha = 0.2, shape = 21, fill = "black") +
    scale_color_distiller(palette = "YlOrRd")

Changing the x- and y-axis scales

Sometimes we may want to zoom in on a particular part of the plot. For example, look at the scatterplot of carat vs. z:

ggplot(diamonds, aes(x = z, y = carat)) +
    geom_point(alpha = 0.2)

While the default plot shows us all the data, most of the plot is wasted space to accommodate a single outlier. The following code allows us to define the limits of the x-axis (only from 1 to 8.5):

ggplot(diamonds, aes(x = z, y = carat)) +
    geom_point(alpha = 0.2) +
    scale_x_continuous(limits = c(0, 8.5))

## Warning: Removed 1 rows containing missing values (geom_point).

R helpfully warns us that the one outlier was removed before plotting.

Instead of using scale_x_continuous(), we could also use coord_cartesian() to achieve the same effect:

ggplot(diamonds, aes(x = z, y = carat)) +
    geom_point(alpha = 0.2) +
    coord_cartesian(xlim = c(0, 8.5))

Notice that in this case, R does not warn us about the outlier. That is because the two functions works differently. With scale_x_continuous(), R removes all points outside the limits, then plots them. With coord_cartesian(), R plots all the points, then zooms in on the specified range. This difference might not seem like a big deal but it can make a difference in some cases. For example, the code below draws a jagged line:

n <- 15
df <- data.frame(x = cos(2 * pi * 1:n / n),
                 y = sin(2 * pi * 1:n / n))
ggplot(df, aes(x = x, y = y)) +
    geom_line()

If we only want to zoom in on the part above the x-axis, coord_cartesian() does the right thing:

ggplot(df, aes(x = x, y = y)) +
    geom_line() +
    coord_cartesian(ylim = c(0, 1))

scale_y_continuous(), on the other hand, does something funky. That’s probably not what we want in this case.

ggplot(df, aes(x = x, y = y)) +
    geom_line() +
    scale_y_continuous(limits = c(0, 1))

## Warning: Removed 2 rows containing missing values (geom_path).

Session info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2         plyr_1.8.4         RColorBrewer_1.1-2
##  [4] pillar_1.4.2       compiler_3.6.1     tools_3.6.1       
##  [7] zeallot_0.1.0      digest_0.6.20      viridisLite_0.3.0 
## [10] lattice_0.20-38    nlme_3.1-140       evaluate_0.14     
## [13] tibble_2.1.3       gtable_0.3.0       mgcv_1.8-28       
## [16] pkgconfig_2.0.2    rlang_0.4.0        Matrix_1.2-17     
## [19] cli_1.1.0          yaml_2.2.0         xfun_0.9          
## [22] withr_2.1.2        dplyr_0.8.3        stringr_1.4.0     
## [25] knitr_1.24         vctrs_0.2.0        grid_3.6.1        
## [28] tidyselect_0.2.5   glue_1.3.1         R6_2.4.0          
## [31] fansi_0.4.0        rmarkdown_1.15     reshape2_1.4.3    
## [34] purrr_0.3.2        magrittr_1.5       splines_3.6.1     
## [37] scales_1.0.0       backports_1.1.4    htmltools_0.3.6   
## [40] assertthat_0.2.1   colorspace_1.4-1   labeling_0.3      
## [43] utf8_1.1.4         stringi_1.4.3      lazyeval_0.2.2    
## [46] munsell_0.5.0      crayon_1.3.4